{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TensorFlow2教程-掩码和填充"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在构建深度学习模型，特别是进行序列的相关任务时，经常会用到掩码和填充技术"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/doit/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n"
     ]
    }
   ],
   "source": [
    "from __future__ import absolute_import, division, print_function, unicode_literals\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "import tensorflow as tf\n",
    "\n",
    "from tensorflow.keras import layers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1 填充序列数据\n",
    "处理序列数据时，常常会遇到不同长度的文本，例如处理某些句子时：\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['The', 'weather', 'will', 'be', 'nice', 'tomorrow'],\n",
       " ['How', 'are', 'you', 'doing', 'today'],\n",
       " ['Hello', 'world', '!']]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[\n",
    "  [\"The\", \"weather\", \"will\", \"be\", \"nice\", \"tomorrow\"],\n",
    "  [\"How\", \"are\", \"you\", \"doing\", \"today\"],\n",
    "  [\"Hello\", \"world\", \"!\"]\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "计算机无法理解字符，我们一般将词语转换为对应的id"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[83, 91, 1, 645, 1253, 927], [73, 8, 3215, 55, 927], [71, 1331, 4231]]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[\n",
    "  [83, 91, 1, 645, 1253, 927],\n",
    "  [73, 8, 3215, 55, 927],\n",
    "  [71, 1331, 4231]\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于深度信息模型的输入一般是固定的张量，所以我们需要对短文本进行填充（或对长文本进行截断），使每个样本的长度相同。\n",
    "Keras提供了一个API，可以通过填充和截断，获取等长的数据：\n",
    "tf.keras.preprocessing.sequence.pad_sequences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[  83   91    1  645 1253  927]\n",
      " [   0   73    8 3215   55  927]\n",
      " [   0    0    0  711  632   71]]\n",
      "[[  83   91    1  645 1253  927]\n",
      " [  73    8 3215   55  927    0]\n",
      " [ 711  632   71    0    0    0]]\n"
     ]
    }
   ],
   "source": [
    "raw_inputs = [\n",
    "  [83, 91, 1, 645, 1253, 927],\n",
    "  [73, 8, 3215, 55, 927],\n",
    "  [711, 632, 71]\n",
    "]\n",
    "# 默认左填充\n",
    "padded_inputs = tf.keras.preprocessing.sequence.pad_sequences(raw_inputs)\n",
    "print(padded_inputs)\n",
    "# 右填充需要设 padding='post'\n",
    "padded_inputs = tf.keras.preprocessing.sequence.pad_sequences(raw_inputs,\n",
    "                                                             padding='post')\n",
    "print(padded_inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2 掩码\n",
    "现在所有样本都获得了同样的长度，在求损失函数或输出的时候往往需要知道那些部分是填充的，需要忽略。这种忽略机制就是掩码。\n",
    "\n",
    "keras中三种添加掩码的方法：\n",
    "- 添加一个keras.layer.Masking的网络层\n",
    "- 配置keras.layers.Embedding层的mask_zero=True\n",
    "- 在支持mark的网络层中，手动传递参数\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 掩码生成层：Embedding和Masking\n",
    "Embedding和Masking会创建一个掩码张量，附加到返回的输出中。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tf.Tensor(\n",
      "[[ True  True  True  True  True  True]\n",
      " [ True  True  True  True  True False]\n",
      " [ True  True  True False False False]], shape=(3, 6), dtype=bool)\n"
     ]
    }
   ],
   "source": [
    "embedding = layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)\n",
    "mask_output= embedding(padded_inputs)\n",
    "print(mask_output._keras_mask)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tf.Tensor(\n",
      "[[ True  True  True  True  True  True]\n",
      " [ True  True  True  True  True False]\n",
      " [ True  True  True False False False]], shape=(3, 6), dtype=bool)\n"
     ]
    }
   ],
   "source": [
    "# 使用掩码层\n",
    "masking_layer = layers.Masking()\n",
    "unmasked_embedding = tf.cast(\n",
    "tf.tile(tf.expand_dims(padded_inputs, axis=-1), [1, 1, 16]),\n",
    "tf.float32)\n",
    "masked_embedding = masking_layer(unmasked_embedding)\n",
    "print(masked_embedding._keras_mask)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "mark是带有2D布尔张量(batch_size, sequence_length)，其中每个单独的False指示在处理过程中应忽略相应的时间步数据。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 函数API和序列API中使用掩码\n",
    "\n",
    "使用函数式API或序列API时，由Embedding或Masking层生成的掩码将通过网络传播到能够使用它们的任何层（例如RNN层）。Keras将自动获取与输入相对应的掩码，并将其传递给知道如何使用该掩码的任何层。\n",
    "\n",
    "请注意，在子类化模型或图层的call方法中，掩码不会自动传播，因此将需要手动将mask参数传递给需要的图层。\n",
    "\n",
    "下面展示了LSTM层怎么自动接受掩码"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 序列式API\n",
    "model = tf.keras.Sequential([\n",
    "  layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True),\n",
    "  layers.LSTM(32),\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 函数式API\n",
    "inputs = tf.keras.Input(shape=(None,), dtype='int32')\n",
    "x = layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)(inputs)\n",
    "outputs = layers.LSTM(32)(x)\n",
    "\n",
    "model = tf.keras.Model(inputs, outputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 也可以直接在层间传递掩码参数\n",
    "\n",
    "可以处理掩码的图层（例如LSTM图层）在其图层中的__call__方法中有一个mask参数。\n",
    "\n",
    "同时，产生掩码的图层（例如Embedding）会公开compute_mask(input, previous_mask)，以获取掩码。\n",
    "\n",
    "因此，可以执行以下操作："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tf.Tensor: id=4712, shape=(32, 32), dtype=float32, numpy=\n",
       "array([[ 0.00255451,  0.00382517, -0.01020782, ..., -0.00271396,\n",
       "        -0.00487344, -0.00039328],\n",
       "       [ 0.00373022,  0.00497113, -0.00294418, ...,  0.00122669,\n",
       "        -0.00351841,  0.00316206],\n",
       "       [ 0.00103166,  0.00395402, -0.00160181, ...,  0.00493634,\n",
       "        -0.0062202 , -0.00349879],\n",
       "       ...,\n",
       "       [-0.00031897, -0.00225923,  0.00185411, ...,  0.00372968,\n",
       "        -0.00407859,  0.00160439],\n",
       "       [ 0.00099225,  0.00042474, -0.00234293, ...,  0.00738544,\n",
       "        -0.00303122, -0.00415802],\n",
       "       [-0.0019643 ,  0.00317807,  0.00983389, ..., -0.00375287,\n",
       "        -0.00127929, -0.00616587]], dtype=float32)>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "class MyLayer(layers.Layer):\n",
    "  \n",
    "    def __init__(self, **kwargs):\n",
    "        super(MyLayer, self).__init__(**kwargs)\n",
    "        self.embedding = layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)\n",
    "        self.lstm = layers.LSTM(32)\n",
    "\n",
    "    def call(self, inputs):\n",
    "        x = self.embedding(inputs)\n",
    "        # 也可手动构造掩码\n",
    "        mask = self.embedding.compute_mask(inputs)\n",
    "        output = self.lstm(x, mask=mask)  # The layer will ignore the masked values\n",
    "        return output\n",
    "\n",
    "layer = MyLayer()\n",
    "x = np.random.random((32, 10)) * 100\n",
    "x = x.astype('int32')\n",
    "layer(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 在自定义网络层中支持掩码\n",
    "有时，可能需要编写生成掩码的图层（例如Embedding），或需要修改当前掩码的图层。\n",
    "\n",
    "例如，产生张量的时间维度与其输入不同的任何层，都需要修改当前掩码，以便下游层能够适当考虑掩码的时间步长。\n",
    "\n",
    "为此，自建网络层应实现layer.compute_mask()方法，该方法将根据输入和当前掩码生成一个新的掩码。\n",
    "\n",
    "大多数图层都不会修改时间维度，因此无需担心掩盖。compute_mask()在这种情况下，默认行为是仅通过当前蒙版。\n",
    "\n",
    "#### TemporalSplit修改当前蒙版的图层的示例。\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tf.Tensor(\n",
      "[[ True  True  True]\n",
      " [ True  True  True]\n",
      " [ True  True  True]], shape=(3, 3), dtype=bool)\n",
      "tf.Tensor(\n",
      "[[ True  True  True]\n",
      " [ True  True False]\n",
      " [False False False]], shape=(3, 3), dtype=bool)\n"
     ]
    }
   ],
   "source": [
    "# 数据拆分时的掩码\n",
    "class TemporalSplit(tf.keras.layers.Layer):\n",
    "\n",
    "    def call(self, inputs):\n",
    "        # 将数据沿时间维度一分为二\n",
    "        return tf.split(inputs, 2, axis=1)\n",
    "\n",
    "    def compute_mask(self, inputs, mask=None):\n",
    "        # 将掩码也一分为二\n",
    "        if mask is None:\n",
    "            return None\n",
    "        return tf.split(mask, 2, axis=1)\n",
    "\n",
    "first_half, second_half = TemporalSplit()(masked_embedding)\n",
    "print(first_half._keras_mask)\n",
    "print(second_half._keras_mask)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "自己构建掩码，把为０的掩掉"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tf.Tensor(\n",
      "[[ True  True  True  True  True  True  True  True False  True]\n",
      " [ True  True  True  True  True  True  True False False False]\n",
      " [False  True  True  True  True  True  True  True  True  True]], shape=(3, 10), dtype=bool)\n"
     ]
    }
   ],
   "source": [
    "class CustomEmbedding(tf.keras.layers.Layer):\n",
    "  \n",
    "    def __init__(self, input_dim, output_dim, mask_zero=False, **kwargs):\n",
    "        super(CustomEmbedding, self).__init__(**kwargs)\n",
    "        self.input_dim = input_dim\n",
    "        self.output_dim = output_dim\n",
    "        self.mask_zero = mask_zero\n",
    "\n",
    "    def build(self, input_shape):\n",
    "        self.embeddings = self.add_weight(\n",
    "        shape=(self.input_dim, self.output_dim),\n",
    "        initializer='random_normal',\n",
    "        dtype='float32')\n",
    "\n",
    "    def call(self, inputs):\n",
    "        return tf.nn.embedding_lookup(self.embeddings, inputs)\n",
    "\n",
    "    def compute_mask(self, inputs, mask=None):\n",
    "        if not self.mask_zero:\n",
    "            return None\n",
    "        return tf.not_equal(inputs, 0)\n",
    "\n",
    "layer = CustomEmbedding(10, 32, mask_zero=True)\n",
    "x = np.random.random((3, 10)) * 9\n",
    "x = x.astype('int32')\n",
    "\n",
    "y = layer(x)\n",
    "mask = layer.compute_mask(x)\n",
    "\n",
    "print(mask)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "直接用网络层实现掩码功能：call函数中接受一个mask参数，并用它来确定是否跳过某些时间步。\n",
    "\n",
    "要编写这样的层，只需在call函数中添加一个mask=None参数即可。只要有可用的输入，与输入关联的掩码将被传递到图层。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MaskConsumer(tf.keras.layers.Layer):\n",
    "  \n",
    "    def call(self, inputs, mask=None):\n",
    "        pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
